This code ‘capsule’ accompanies the course entitled: DIYtranscriptomics, a semester-long course covering best practices for RNAseq data analysis, with a primary focus on empowering students to be independent in the use of lightweight and open-source software and the R/bioconductor environment. This course is currently offered as CAMB714 at the University of Pennsylvania. This capsule contains all the code, data and software used in the course, allowing any instructor with an internet connection to reproduce the entire course.
The dataset used in this capsule is from the manuscript “Variable gene expression and parasite load predict treatment outcome in cutaneous leishmaniasis”. A link to the raw fastq file data hosted on GEO is provided here. The dataset includes 21 RNA-seq samples of lesion biopsies from patients with cutanoues leishmaniasis and 7 RNA-seq samples from biopsies from the healthy skin of non-infected subjects. More information about the RNA-seq samples can be found in the manuscript.
A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the tidyverse suite of packages. Typographic style for graphs was set using hrbrthemes. This reproducible and dynamic report summarizes the basic code and outputs (plots, tables, etc) produced during the course, and was created using Rmarkdown and the Knitr package. All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github.
library(tidyverse)
library(reshape2)
library(tximport)
library(biomaRt)
library(ensembldb)
library(RColorBrewer)
library(genefilter)
library(edgeR)
library(matrixStats)
library(gplots)
library(limma)
library(heatmaply)
library(DT)
library(gt)
library(plotly)
library(hrbrthemes)
library(EnsDb.Hsapiens.v86)
library(cowplot)
library(gprofiler2)
library(GSEABase)Raw reads were mapped to the human reference transcriptome using Kallisto, version 0.45. The quality of raw reads, as well as the results of Kallisto mapping were assessed using fastqc and multiqc.
After read mapping with Kallisto, TxImport was used to read kallisto outputs into the R environment. Annotation data from Biomart was used to ‘collapse’ data from transcript-level to gene-level.
# copy and paste in the essential code from the step 1 script
targets <- read_tsv("/data/course_data/studydesign.txt")# read in your study design
path <- file.path("/data/course_data", targets$sample, "abundance.h5") # set file paths to your mapped data
Tx <- transcripts(EnsDb.Hsapiens.v86, columns=c("tx_id", "gene_name"))
Tx <- as_tibble(Tx)
Tx <- dplyr::rename(Tx, target_id = tx_id)
Tx <- dplyr::select(Tx, "target_id", "gene_name")
Txi_gene <- tximport(path,
type = "kallisto",
tx2gene = Tx,
txOut = FALSE, #determines whether your data represented at transcript or gene level
countsFromAbundance = "lengthScaledTPM",
ignoreTxVersion = TRUE)sampleLabels <- targets$sample
groups <- factor(targets$group)
myDGEList <- DGEList(Txi_gene$counts)
log2.cpm <- cpm(myDGEList, log=TRUE)
log2.cpm.df <- as_tibble(log2.cpm, rownames = "geneID")
colnames(log2.cpm.df) <- c("geneID", sampleLabels)
log2.cpm.df.melt <- melt(log2.cpm.df)
p1 <- ggplot(log2.cpm.df.melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
labs(y="log2 expression", x = "sample",
title="Log2 Counts per Million (CPM)",
subtitle="unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time())) +
coord_flip()+
theme_bw()
cpm <- cpm(myDGEList)
min_sample = as.vector(table(groups))
num_classes = length(min_sample)
min_size = min_sample[order(min_sample,decreasing=FALSE)[1]]
keepers <- rowSums(cpm>1)>=min_size
myDGEList.filtered <- myDGEList[keepers,]
log2.cpm.filtered <- cpm(myDGEList.filtered, log=TRUE)
log2.cpm.filtered.df <- as_tibble(log2.cpm.filtered, rownames = "geneID")
colnames(log2.cpm.filtered.df) <- c("geneID", sampleLabels)
log2.cpm.filtered.df.melt <- melt(log2.cpm.filtered.df)
p2 <- ggplot(log2.cpm.filtered.df.melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
labs(y="log2 expression", x = "sample",
title="Log2 Counts per Million (CPM)",
subtitle="filtered, non-normalized",
caption=paste0("produced on ", Sys.time())) +
coord_flip()+
theme_bw()
myDGEList.filtered.norm <- calcNormFactors(myDGEList.filtered, method = "TMM")
log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE)
log2.cpm.filtered.norm.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneID")
colnames(log2.cpm.filtered.norm.df) <- c("geneID", sampleLabels)
log2.cpm.filtered.norm.df.melt <- melt(log2.cpm.filtered.norm.df)
p3 <- ggplot(log2.cpm.filtered.norm.df.melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
labs(y="log2 expression", x = "sample",
title="Log2 Counts per Million (CPM)",
subtitle="filtered, TMM normalized",
caption=paste0("produced on ", Sys.time())) +
coord_flip()+
theme_bw()
plot_grid(p1, p2, p3, labels = c('A', 'B', 'C'), label_size = 12)Filtering was carried out to remove lowly expressed genes, reducing the number of genes from 35630 to 16711.
The table shown below includes expression data for 16711 genes. You can sort and search the data directly from the table.
mydata.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneSymbol")
colnames(mydata.df) <- c("geneSymbol", sampleLabels)
mydata.df <- mutate(mydata.df,
healthy_controls.AVG = (host_HS01 + host_HS02 + host_HS03 + host_HS04 + host_HS05 + host_HS06 + host_HS07)/7,
CL_patients.AVG = (host_CL01 + host_CL02 + host_CL03 + host_CL04 + host_CL05 + host_CL06 + host_CL07 + host_CL08 + host_CL09 + host_CL10 + host_CL11 + host_CL12 + host_CL13 + host_CL14 + host_CL15 + host_CL16 + host_CL17 + host_CL18 + host_CL19 + host_CL20 + host_CL21)/21,
#now make columns comparing each of the averages above that you're interested in
LogFC.CL_patients_vs_healthy_controls = (CL_patients.AVG - healthy_controls.AVG)) %>%
mutate_if(is.numeric, round, 2) %>%
select(geneSymbol,healthy_controls.AVG,CL_patients.AVG,LogFC.CL_patients_vs_healthy_controls) %>%
arrange(desc(LogFC.CL_patients_vs_healthy_controls))
datatable(mydata.df,
extensions = c('KeyTable', "FixedHeader"),
caption = 'my cool table)',
options = list(keys = TRUE,
searchHighlight = TRUE,
pageLength = 10,
lengthMenu = c("10", "25", "50", "100")))pca.res <- prcomp(t(log2.cpm.filtered.norm), scale.=F, retx=T)
pc.var<-pca.res$sdev^2
pc.per<-round(pc.var/sum(pc.var)*100, 1)
pca.res.df <- as_tibble(pca.res$x)
pca.plot <- ggplot(pca.res.df, aes(x=PC1, y=PC2, color=targets$group)) +
geom_point(size=4) +
xlab(paste0("PC1 (",pc.per[1],"%",")")) +
ylab(paste0("PC2 (",pc.per[2],"%",")")) +
labs(title="PCA plot",
caption=paste0("produced on ", Sys.time())) +
theme_bw()
ggplotly(pca.plot)design <- model.matrix(~0 + groups)
colnames(design) <- levels(groups)
v.DEGList.filtered.norm <- voom(myDGEList.filtered.norm, design, plot = FALSE)
fit <- lmFit(v.DEGList.filtered.norm, design)
contrast.matrix <- makeContrasts(disease = cutaneous - control,
levels=design)
fits <- contrasts.fit(fit, contrast.matrix)
ebFit <- eBayes(fits)
myTopHits <- topTable(ebFit, adjust ="BH", coef=1, number=40000, sort.by="logFC")
myTopHits <- as_tibble(myTopHits, rownames = "geneSymbol")
ggplotly(ggplot(myTopHits, aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", geneSymbol))) +
geom_point(size=2) +
ylim(-0.5,12) +
geom_hline(yintercept = -log10(0.01), linetype="longdash", colour="grey", size=1) +
geom_vline(xintercept = 1, linetype="longdash", colour="#BE684D", size=1) +
geom_vline(xintercept = -1, linetype="longdash", colour="#2C467A", size=1) +
theme_bw())datatable(myTopHits,
extensions = c('KeyTable', "FixedHeader"),
caption = 'Table 1: Differential gene expression analysis for infected (cutaneous leishmaniasis) vs control',
options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
formatRound(columns=c(1:10), digits=2)To identify differentially expressed genes, precision weights were first applied to each gene based on its mean-variance relationship using VOOM, then data was normalized using the TMM method in EdgeR. Linear modeling and bayesian stats were employed via Limma to find genes that were up- or down-regulated in L. braziliensis-infected cells, compared to uninfected cells, by 2-fold or more, with a false-discovery rate (FDR) of 0.01.
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=2)
colnames(v.DEGList.filtered.norm$E) <- sampleLabels
diffGenes <- v.DEGList.filtered.norm$E[results[,1] !=0,]
diffGenes.df <- as_tibble(diffGenes, rownames = "geneSymbol")
datatable(diffGenes.df,
extensions = c('KeyTable', "FixedHeader"),
caption = 'Table 1: DEGs for infected (cutaneous leishmaniasis) vs control',
options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
formatRound(columns=c(1:length(targets$sample)+1), digits=2)Pearson correlation was used to cluster 2242 differentially expressed genes, which were then represented as heatmap with the data scaled by Zscore for each row.
myheatcolors2 <- colorRampPalette(colors=c("yellow","white","blue"))(100)
clustRows <- hclust(as.dist(1-cor(t(diffGenes), method="pearson")), method="complete") #cluster rows by pearson correlation
clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete")
module.assign <- cutree(clustRows, k=2)
module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9)
module.color <- module.color[as.vector(module.assign)]
#pick modules corresponding to up- and down-regulated genes
#choose a cluster(s) of interest by selecting the corresponding number based on the previous graph
module.up <- 2 #up regulated genes in the heatmap produced by heatmaply
module.down <- 1 #down regulated genes in the heatmap produced by heatmaply
module.up <- diffGenes[names(module.assign[module.assign %in% module.up]),]
module.down <- diffGenes[names(module.assign[module.assign %in% module.down]),]
heatmap.2(diffGenes, #or diffGenes
Rowv=as.dendrogram(clustRows),
Colv=as.dendrogram(clustColumns),
RowSideColors=module.color,
col=myheatcolors2, scale='row', labRow=NA,
density.info="none", trace="none",
cexRow=1, cexCol=1, margins=c(8,20)) GO enrichment for the 1290 genes upregulated in the skin of patients with cutaneous leishmaniasis
gost.res <- gost(rownames(module.up), organism = "hsapiens")
gostplot(gost.res)GO enrichment for the 952 genes downregulated in the skin of patients with cutaneous leishmaniasis
gost.res <- gost(rownames(module.down), organism = "hsapiens")
gostplot(gost.res)broadSet.C2.CP <- getGmt("../data/MSigDB/c2.cp.v7.0.symbols.gmt", geneIdType=SymbolIdentifier())
broadSet.C2.CP <- geneIds(broadSet.C2.CP)
GSEAres <- camera(v.DEGList.filtered.norm$E, broadSet.C2.CP, design, contrast.matrix[,1])
GSEAres <- as_tibble(GSEAres, rownames = "setName")
GSEAres <- filter(GSEAres, FDR<=0.05)
datatable(GSEAres,
extensions = c('KeyTable', "FixedHeader"),
caption = 'Signatures enriched in infected cells',
options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
formatRound(columns=c(1:length(targets$sample)+1), digits=3)Describe the results in your own words. Some things to think about:
The output from running ‘sessionInfo’ is shown below and details all packages and version necessary to reproduce the results in this report.
sessionInfo()## R version 3.6.0 (2019-04-26)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so
##
## locale:
## [1] en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GSEABase_1.48.0 graph_1.64.0
## [3] annotate_1.64.0 XML_3.98-1.19
## [5] gprofiler2_0.1.7 cowplot_1.0.0
## [7] EnsDb.Hsapiens.v86_2.99.0 hrbrthemes_0.7.2
## [9] gt_0.1.0 DT_0.5
## [11] heatmaply_0.16.0 viridis_0.5.1
## [13] viridisLite_0.3.0 plotly_4.9.0
## [15] gplots_3.0.1.1 matrixStats_0.54.0
## [17] edgeR_3.28.0 limma_3.42.0
## [19] genefilter_1.68.0 RColorBrewer_1.1-2
## [21] ensembldb_2.10.0 AnnotationFilter_1.10.0
## [23] GenomicFeatures_1.38.0 AnnotationDbi_1.48.0
## [25] Biobase_2.46.0 GenomicRanges_1.38.0
## [27] GenomeInfoDb_1.22.0 IRanges_2.20.0
## [29] S4Vectors_0.24.0 BiocGenerics_0.32.0
## [31] biomaRt_2.42.0 tximport_1.14.0
## [33] reshape2_1.4.3 forcats_0.4.0
## [35] stringr_1.4.0 dplyr_0.8.5
## [37] purrr_0.3.4 readr_1.3.1
## [39] tidyr_1.0.2 tibble_3.0.0
## [41] ggplot2_3.3.0 tidyverse_1.2.1
## [43] knitr_1.28 rmarkdown_2.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.6
## [3] systemfonts_0.2.0 BiocFileCache_1.10.0
## [5] plyr_1.8.4 lazyeval_0.2.2
## [7] splines_3.6.0 crosstalk_1.0.0
## [9] BiocParallel_1.20.0 digest_0.6.25
## [11] foreach_1.4.4 htmltools_0.4.0
## [13] gdata_2.18.0 fansi_0.4.1
## [15] checkmate_2.0.0 magrittr_1.5
## [17] memoise_1.1.0 cluster_2.0.8
## [19] gclus_1.3.2 Biostrings_2.54.0
## [21] extrafont_0.17 modelr_0.1.4
## [23] extrafontdb_1.0 askpass_1.0
## [25] prettyunits_1.1.1 colorspace_1.4-1
## [27] blob_1.1.1 rvest_0.3.3
## [29] rappdirs_0.3.1 haven_2.1.0
## [31] xfun_0.13 crayon_1.3.4
## [33] RCurl_1.95-4.12 jsonlite_1.6.1
## [35] survival_2.44-1.1 iterators_1.0.10
## [37] glue_1.4.0 registry_0.5-1
## [39] gtable_0.3.0 zlibbioc_1.32.0
## [41] XVector_0.26.0 webshot_0.5.1
## [43] DelayedArray_0.12.0 Rhdf5lib_1.8.0
## [45] Rttf2pt1_1.3.8 scales_1.1.0
## [47] DBI_1.0.0 Rcpp_1.0.4.6
## [49] xtable_1.8-4 progress_1.2.0
## [51] bit_1.1-14 htmlwidgets_1.3
## [53] httr_1.4.0 ellipsis_0.3.0
## [55] farver_2.0.3 pkgconfig_2.0.3
## [57] dbplyr_1.4.0 locfit_1.5-9.4
## [59] later_0.8.0 labeling_0.3
## [61] tidyselect_1.0.0 rlang_0.4.5
## [63] munsell_0.5.0 cellranger_1.1.0
## [65] tools_3.6.0 cli_2.0.2
## [67] generics_0.0.2 RSQLite_2.1.1
## [69] broom_0.5.2 evaluate_0.14
## [71] yaml_2.2.1 bit64_0.9-7
## [73] caTools_1.17.1.2 dendextend_1.13.4
## [75] nlme_3.1-147 mime_0.9
## [77] xml2_1.2.0 compiler_3.6.0
## [79] rstudioapi_0.11 curl_3.3
## [81] stringi_1.4.6 gdtools_0.2.2
## [83] lattice_0.20-41 ProtGenerics_1.18.0
## [85] Matrix_1.2-18 vctrs_0.2.4
## [87] pillar_1.4.3 lifecycle_0.2.0
## [89] data.table_1.12.2 bitops_1.0-6
## [91] seriation_1.2-8 httpuv_1.5.1
## [93] rtracklayer_1.46.0 R6_2.4.1
## [95] promises_1.0.1 TSP_1.1-7
## [97] KernSmooth_2.23-15 gridExtra_2.3
## [99] codetools_0.2-16 MASS_7.3-51.5
## [101] gtools_3.8.1 assertthat_0.2.1
## [103] rhdf5_2.30.0 SummarizedExperiment_1.16.0
## [105] openssl_1.3 withr_2.1.2
## [107] GenomicAlignments_1.22.0 Rsamtools_2.2.0
## [109] GenomeInfoDbData_1.2.2 hms_0.4.2
## [111] grid_3.6.0 shiny_1.3.2
## [113] lubridate_1.7.4